Reward modeling for mitigating toxicity in transformer-based language models

نویسندگان

چکیده

Transformer-based language models can generate fluent text and be efficiently adapted across various natural generation tasks. However, that are pretrained on large unlabeled web corpora have been shown to suffer from degenerating toxic content social bias behaviors, consequently hindering their safe deployment. Various detoxification methods proposed mitigate model toxicity; however, these struggle detoxify when conditioned prompts contain specific identities related gender, race, or religion. In this study, we propose Reinforce-Detoxify, a reinforcement learning-based method for mitigating toxicity in models. We address the challenge of safety new reward detect unintended towards prediction. The experiments demonstrate Reinforce-Detoxify outperforms existing approaches automatic evaluation metrics, indicating our approach is less prone toward generated content.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Power Transformer Modeling for Inrush Current Calculation

The paper documents a new transformer model in ATPDraw called XFMR. This model handles 3-phase transformers with two or three windings. Autotransformers and all Wye and Delta couplings are supported. The model includes an inverse inductance matrix for the leakage description, optional frequency dependent winding resistance, capacitive coupling, and a topologically correct core model (3and 5-leg...

متن کامل

Modeling Quantum Entanglements in Quantum Language Models

Recently, a Quantum Language Model (QLM) was proposed to model term dependencies upon Quantum Theory (QT) framework and successively applied in Information Retrieval (IR). Nevertheless, QLM’s dependency is based on co-occurrences of terms and has not yet taken into account the Quantum Entanglement (QE), which is a key quantum concept and has a significant cognitive implication. In QT, an entang...

متن کامل

Transformer Modeling Based on Standard Frequency Response Measurements

High frequency models of large power transformers are required for analysis of transient interaction phenomena between transformers and the power system. Fast transient overvoltages may lead to transformer dielectric failures. Deeper understanding of the mechanisms may help to take actions against possible damages. This paper describes the simulation principle of transient interaction in matter...

متن کامل

A reward-based approach for preference modeling: A case study

Abstract Most of reasoning for decision making in daily life is based on preferences. As other kinds of reasoning processes, there are many formalisms trying to capture preferences, however none of them is able to capture all the subtleties of the human reasoning. In this paper we analise how to formalize the preferences expressed by humans and how to reason with them to produce rankings. Parti...

متن کامل

A reward-based approach for preference modeling: A case study

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Applied Intelligence

سال: 2022

ISSN: ['0924-669X', '1573-7497']

DOI: https://doi.org/10.1007/s10489-022-03944-z